Thai National Corpus: A Progress Report

نویسندگان

  • Wirote Aroonmanakun
  • Kachen Tansiri
  • Pairit Nittayanuparp
چکیده

This paper presents problems and solutions in developing Thai National Corpus (TNC). TNC is designed to be a comparable corpus of British National Corpus. The project aims to collect eighty million words. Since 2006, the project can now collect only fourteen million words. The data is accessible from the TNC Web. Delay in creating the TNC is mainly caused from obtaining authorization of copyright texts. Methods used for collecting data and the results are discussed. Errors during the process of encoding data and how to handle these errors will be described. 1 Thai National Corpus Thai National Corpus (TNC) is a general corpus of the standard Thai language (Aroonmanakun, 2007). It is designed to be comparable to the British National Corpus (Aston and Burnard, 1998) in terms of its domain and medium proportions. However, only written texts are collected in the TNC, and the corpus size is targeted at eighty million words. In addition to domain and medium criteria, texts are also selected and categorized on the basis of their genres. We adopted Lee’s idea of categorizing texts into different genres based on external factors like the purpose of communication, participants, and the settings of communication (Lee 2001). Texts in the same genre share the same characteristics of language usages, e.g. discourse structure, sentence patterns, etc. Moreover, since TNC is a representative of the standard Thai language at present, 90% of the texts will be texts produced before 1998. The rest 10% can be texts produced before 1998 if they are published recently. Therefore, the structure of TNC is shaped on the dimensions of domain, medium, genres and time (see Table 1). Texts that fit into the designed portion of these criteria will be selected. After that, copyright holders of each text will be contacted and asked to sign a permission form. To make this process easier, the same form is used for all copyright holders. When authorization is granted, texts are randomly selected either from the beginning, the middle, the end, or selected from many sections. Sampling size can vary, but the maximum size will not exceed 40,000 words or about 80 pages of A4 paper. In this TNC project, we use the TEI guideline, “TEI P4”, as the markup language. Three types of information are marked in the document: documentation of encoded data, primary data, and linguistic annotation. Documentation of encoded data is the markup used for contextual information about the text. Primary data refers to the basic elements in the text, such as paragraphs, sections, sentences, etc. Linguistic annotation is the markup used for linguistic analysis, such as parts of speech, sentence structures, etc. The first two types are the minimum requirements for marking up texts. The structure of each document is represented in the following tags: ...markup for contextual information ... ...body text, markup for primary data e.g. and linguistic analysis e.g. , .... For linguistic annotation, we mark word boundaries and transcriptions for every word. Information of parts-of-speech will not be marked at present. The following is an example of markup in a document. ก 3 ก ก We recognize that marking tags manually is a difficult and a time-consuming task, so for this project, two programs are used for tagging language data and contextual information. TNC Tagger is used for segmenting words and marking basic tags and in the text. Word segmentation and transcription program proposed in Aroonmanakun and Rivepiboon (2004) is used as a tagger. TNC Header is used for inputting contextual information and generating header tag for each text. Output from TNC Tagger will be combined with the header tag as an XML document.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Semantics of khn3 and lo1 in Thai Compared to up and down in English: A Corpus-based Study

This corpus-based study analyzes meanings of khɨn3 ‘ascend’ and loŋ1 ‘descend’ in Thai in comparison with up and down in English. Data came from three corpora: the Thai National Corpus (TNC) (Aroonmanakun et al., 2009), the British National Corpus (BNC), and the English-Thai Parallel Concordance (Aroonmanakun, 2009). Results of the analyses show that there are senses of the vertical spatial ter...

متن کامل

Building A Large Thai Text Corpus - Part-Of-Speech Tagged Corpus: ORCHID -

This paper presents a procedure in building a Thai part-of-speech (POS) tagged corpus named ORCHID. It is a collaboration project between Communications Research Laboratory (CRL) of Japan and National Electronics and Computer Technology Center (NECTEC) of Thailand. We proposed a new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translation pr...

متن کامل

ORCHID: Thai Part-Of-Speech Tagged Corpus

This paper presents a procedure in building a Thai part-of-speech (POS) tagged corpus named ORCHID [1]. It is a collaboration project between Communications Research Laboratory (CRL) of Japan and National Electronics and Computer Technology Center (NECTEC) of Thailand. We proposed a new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translatio...

متن کامل

Distributional Semantics Approach to Thai Word Sense Disambiguation

Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy...

متن کامل

Curcin Intoxication in Royal Thai Army Privates: A Case Report

Background: Jatropha curcas,commonly known as “Saboo Dum,” is the most common plant poisoning in Thailand. Saboo Dum seeds are used as raw material for biodiesel fuel manufacture, especially in Royal Thai Military units. The seed contains a toxin called curcin which can cause hepatotoxicity in humans. Case Presentation: We reported twenty-eight private soldiers who were brought to the emergency...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009